To download Natual Language tool kit in terminal type sudo pip install -U nltk

You'll need of gensim package too. To get this package type in the terminal pip install -U gensim.

sudo pip install textblob

sudo pip install fuzzy


In [1]:
import nltk 
#nltk.download()

In [2]:
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

In [3]:
_remove_noise("this is a sample text")


Out[3]:
'sample text'

Text processing steps:

  • Noise Removal
  • Lexicon Normalization
  • Object Standardization

Text processing pipeline

  • 1) Raw Text
  • 2) (Noisy Entities Removal) Stopwords, URLs, punctuations, mentions, etc
  • 3) (Word Normalization) Tokenization, Memmatization, Stemming
  • 4) (Word Standarization) Regular Expressions, Lookup tables
  • 5) Cleaned text

2) (Noisy Entities Removal) Stopwords, URLs, punctuations, mentions, etc


In [4]:
# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."] 
def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

In [5]:
_remove_noise("this is a sample text")


Out[5]:
'sample text'

In [6]:
# Sample code to remove a regex pattern 
import re

In [7]:
def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

In [8]:
regex_pattern = "#[\w]*"

In [9]:
_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)


Out[9]:
'remove this  from analytics vidhya'

2.2) Lexicon Normalization

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma).

  • Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

  • Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).


In [11]:
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer

In [12]:
lem = WordNetLemmatizer()
stem = PorterStemmer()

In [13]:
word = "multiplying"

In [14]:
lem.lemmatize(word, "v")


Out[14]:
'multiply'

In [15]:
stem.stem(word)


Out[15]:
'multipli'

2.3) Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.


In [17]:
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love"}

In [24]:
def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) new_text = " ".join(new_words) 
        return new_text


  File "<ipython-input-24-4b0dfb951aaf>", line 7
    new_words.append(word) new_text = " ".join(new_words)
                                  ^
SyntaxError: invalid syntax

In [25]:
_lookup_words("RT this is a retweeted tweet by Shivam Bansal")


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-25-26d2751afec0> in <module>()
----> 1 _lookup_words("RT this is a retweeted tweet by Shivam Bansal")

<ipython-input-22-ba79bd13da6c> in _lookup_words(input_text)
      6             word = lookup_dict[word.lower()]
      7         #new_words.append(word) new_text = " ".join(new_words)
----> 8         return new_text

NameError: name 'new_text' is not defined

3.Text to Features (Feature Engineering on text data)


In [26]:
from nltk import word_tokenize, pos_tag

In [27]:
text = "I am learning Natural Language Processing on Analytics Vidhya"

In [28]:
tokens = word_tokenize(text)

In [29]:
print pos_tag(tokens)


  File "<ipython-input-29-10666751bd39>", line 1
    print pos_tag(tokens)
                ^
SyntaxError: invalid syntax

In [30]:
pos_tag(tokens)


Out[30]:
[('I', 'PRP'),
 ('am', 'VBP'),
 ('learning', 'VBG'),
 ('Natural', 'NNP'),
 ('Language', 'NNP'),
 ('Processing', 'NNP'),
 ('on', 'IN'),
 ('Analytics', 'NNP'),
 ('Vidhya', 'NNP')]

In [31]:
print (pos_tag(tokens) )


[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'), ('Vidhya', 'NNP')]

In [32]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."

In [33]:
doc_complete = [doc1, doc2, doc3]

In [34]:
doc_clean = [doc.split() for doc in doc_complete]

In [43]:
import gensim 

from gensim import corpora

In [44]:
# Creating the term dictionary of our corpus, where every unique term is assigned an index.  
dictionary = corpora.Dictionary(doc_clean)

In [45]:
# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. 
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [46]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

In [47]:
# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [48]:
# Results 
print(ldamodel.print_topics())


[(0, '0.029*"sister" + 0.029*"my" + 0.029*"My" + 0.029*"to" + 0.029*"stress" + 0.029*"pressure." + 0.029*"increased" + 0.029*"that" + 0.029*"cause" + 0.029*"and"'), (1, '0.089*"to" + 0.051*"My" + 0.051*"my" + 0.051*"sister" + 0.051*"consume." + 0.051*"sugar," + 0.051*"Sugar" + 0.051*"father." + 0.051*"bad" + 0.051*"but"'), (2, '0.064*"driving" + 0.037*"around" + 0.037*"dance" + 0.037*"practice." + 0.037*"time" + 0.037*"a" + 0.037*"lot" + 0.037*"of" + 0.037*"father" + 0.037*"spends"')]

In [49]:
def generate_ngrams(text, n):
    words = text.split()
    output = []  
    for i in range(len(words)-n+1):
        output.append(words[i:i+n])
    return output

In [50]:
generate_ngrams('this is a sample text', 2)


Out[50]:
[['this', 'is'], ['is', 'a'], ['a', 'sample'], ['sample', 'text']]

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
obj = TfidfVectorizer()

In [53]:
corpus = ['This is sample document.', 'another random document.', 'third sample document text']

In [54]:
X = obj.fit_transform(corpus)

In [55]:
print (X)


  (0, 7)	0.58448290102
  (0, 2)	0.58448290102
  (0, 4)	0.444514311537
  (0, 1)	0.345205016865
  (1, 1)	0.385371627466
  (1, 0)	0.652490884513
  (1, 3)	0.652490884513
  (2, 4)	0.444514311537
  (2, 1)	0.345205016865
  (2, 6)	0.58448290102
  (2, 5)	0.58448290102

In [56]:
from gensim.models import Word2Vec

In [57]:
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep', 'learning']]

In [58]:
# train the model on your corpus  
model = Word2Vec(sentences, min_count = 1)


WARNING:gensim.models.word2vec:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

In [60]:
print (model.similarity('data', 'science'))


0.098663955684

In [61]:
print (model['learning'])


[  1.08607556e-03   4.62277047e-03   2.58435309e-03  -4.26230673e-03
   4.32864809e-03  -4.04330960e-04  -1.75475678e-03  -3.26948427e-03
  -4.35007038e-03  -9.38271580e-04  -2.72817072e-03   3.39866313e-03
  -4.28924803e-03   2.53001228e-03  -1.47502718e-03  -4.54866001e-03
  -1.19755440e-03   1.36745919e-03  -4.99364780e-03   4.39920370e-03
  -8.78889696e-04   2.55907397e-03  -4.47233114e-03  -2.98093841e-03
  -3.04079871e-03   3.77006200e-03  -5.06169279e-04  -1.58164476e-04
  -4.97120013e-03   4.11883416e-03  -1.16382574e-03   3.81740881e-03
  -1.49161392e-03  -4.03360883e-03   3.22279660e-03  -2.94679590e-03
   3.02863074e-03  -3.42801865e-03  -3.52651492e-04  -3.85172991e-03
  -2.11770809e-03  -4.80807154e-03   1.37284151e-04   2.76812771e-03
  -3.94002767e-03  -3.65456362e-04  -1.79178803e-03  -3.28000169e-03
  -1.05990539e-03   2.68064812e-03   8.77506754e-05  -2.99095735e-03
   1.89492374e-03   4.13068919e-05  -5.89237607e-04  -6.49927882e-04
   2.57901847e-03  -7.03117403e-04   4.20667045e-03  -2.76946439e-03
   2.95516499e-03  -2.25505629e-03  -1.83734728e-03   2.27217446e-03
   4.41791257e-03   2.84861424e-04   3.51103576e-04   3.54659790e-03
  -2.52496474e-03   4.43121325e-03  -1.20758770e-04   8.45276692e-04
   4.18963004e-03  -2.65925453e-04   4.44951747e-03  -3.20018292e-03
   3.99162294e-03   3.94796021e-04  -1.47228176e-03   1.47388573e-03
   1.93689705e-03   2.78494786e-04  -3.44451889e-03   4.34742076e-03
   3.53631843e-03   1.57816766e-03  -3.53800104e-04   1.87509082e-04
  -2.43617431e-03   1.70787866e-03  -4.31234203e-03  -4.08355193e-03
  -4.45934990e-03   2.58488394e-03  -2.31626094e-03  -1.79338094e-03
   3.77377262e-03  -4.37695207e-03   9.26580222e-04  -3.43537773e-03]

In [64]:
from textblob.classifiers import NaiveBayesClassifier as NBC

In [65]:
from textblob import TextBlob

In [66]:
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]

In [67]:
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

In [68]:
model = NBC(training_corpus)

In [69]:
print((model.classify("Their codes are amazing.")))


Class_A

In [70]:
print((model.classify("I don't like their computer.")))


Class_B

In [71]:
print((model.accuracy(test_corpus)))


0.8333333333333334

In [73]:
from sklearn.feature_extraction.text


  File "<ipython-input-73-30f02bccd6dd>", line 1
    from sklearn.feature_extraction.text
                                        ^
SyntaxError: invalid syntax

In [80]:
#import TfidfVectorizer from sklearn.metrics
from sklearn.feature_extraction.text import TfidfVectorizer

In [93]:
#import classification_report
from sklearn import metrics
from sklearn.metrics import classification_report

In [76]:
from sklearn import svm

In [81]:
#from sklearn import sklearn.feature_extraction.text

In [83]:
# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

In [84]:
# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)

In [85]:
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)

In [86]:
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

In [87]:
# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear')

In [88]:
model.fit(train_vectors, train_labels)


Out[88]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [89]:
prediction = model.predict(test_vectors)

In [94]:
print ((classification_report(test_labels, prediction)))


             precision    recall  f1-score   support

    Class_A       0.50      0.67      0.57         3
    Class_B       0.50      0.33      0.40         3

avg / total       0.50      0.50      0.49         6


In [95]:
def levenshtein(s1,s2): 
    if len(s1) > len(s2):
        s1,s2 = s2,s1 
    distances = range(len(s1) + 1) 
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1]) 
            else:
                 newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-1]))) 
        distances = newDistances 
    return distances[-1]

In [96]:
print(levenshtein("analyze","analyse"))


1

In [100]:
import fuzzy


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-100-c25d859f085a> in <module>()
----> 1 import fuzzy

ImportError: No module named 'fuzzy'

In [98]:
soundex = fuzzy.Soundex(4)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-98-89da4f0cbd79> in <module>()
----> 1 soundex = fuzzy.Soundex(4)

NameError: name 'fuzzy' is not defined

In [99]:
print (soundex('ankit'))


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-99-9246f0b975de> in <module>()
----> 1 print (soundex('ankit'))

NameError: name 'soundex' is not defined

In [ ]:
print soundex('aunkit')

In [ ]:
import math

In [ ]:
from collections import Counter

In [ ]:
def get_cosine(vec1, vec2):
    common = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in common])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()]) 
    sum2 = sum([vec2[x]**2 for x in vec2.keys()]) 
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
   
    if not denominator:
        return 0.0 
    else:
        return float(numerator) / denominator

In [ ]:
def text_to_vector(text): 
    words = text.split() 
    return Counter(words)

In [101]:
text1 = 'This is an article on analytics vidhya' 
text2 = 'article on analytics vidhya is about natural language processing'

In [ ]:
vector1 = text_to_vector(text1)

In [ ]:
vector2 = text_to_vector(text2)

In [ ]:
cosine = get_cosine(vector1, vector2)